-
Notifications
You must be signed in to change notification settings - Fork 909
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved documentation for configuring dataset parameters in the data catalog #3969
Conversation
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
docs/source/data/data_catalog.md
Outdated
2. The next level includes multiple keys. The first one is the mandatory key, `type,` which defines the type of dataset to use. | ||
The rest of the keys are dataset properties and vary depending on the implementation. | ||
To get the extensive list of dataset properties, refer to {py:mod}`The kedro-datasets package documentation <kedro-datasets:kedro_datasets>` and navigate to the `__init__` method of the target dataset. | ||
3. Some dataset properties can be further configured depending on the libraries underlying the dataset implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's an example of this and how this different from 2.? Where can user find information about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example is in the following line. I've extended it a bit for clarity. The difference is that some of the parameters require referring to the underlying library methods to get the full set of the parameters accepted. It is not clear for some users, so we wanted to explicitly show that in the docs.
Signed-off-by: Elena Khaustova <[email protected]>
docs/source/data/data_catalog.md
Outdated
The dataset configuration in `catalog.yml` is defined as follows: | ||
1. The Top-level key is the dataset name used as a dataset identifier in the catalog - `shuttles`, `weather` in the example below. | ||
2. The next level includes multiple keys. The first one is the mandatory key, `type,` which defines the type of dataset to use. | ||
The rest of the keys are dataset properties and vary depending on the implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be cool to do something like
Important
Kedro datasets make every intention to not make any assumptions and delegate any of the load_args
/ save_args
directly to the underlying implementation.
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small suggestions but LGTM! 🚀
docs/source/data/data_catalog.md
Outdated
@@ -36,20 +36,24 @@ shuttles: | |||
The dataset configuration in `catalog.yml` is defined as follows: | |||
1. The Top-level key is the dataset name used as a dataset identifier in the catalog - `shuttles`, `weather` in the example below. | |||
2. The next level includes multiple keys. The first one is the mandatory key, `type,` which defines the type of dataset to use. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2. The next level includes multiple keys. The first one is the mandatory key, `type,` which defines the type of dataset to use. | |
2. The next level includes multiple keys. The first one is the mandatory key, `type`, which defines the type of dataset to use. |
docs/source/data/data_catalog.md
Outdated
The rest of the keys are dataset parameters and vary depending on the implementation. | ||
To get the extensive list of dataset parameters, refer to {py:mod}`The kedro-datasets package documentation <kedro-datasets:kedro_datasets>` and navigate to the `__init__` method of the target dataset. | ||
3. Some dataset parameters can be further configured depending on the libraries underlying the dataset implementation. | ||
In the example below, the configuration of the `load_args` parameter is defined by the `pandas` option for loading CSV files, while the configuration of the `save_args` parameter is defined by the `snowpark` `saveAsTable` method. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would also add the dataset names here, like shuttles
when mentioning load_args
and weather
for save_args
example and break this into two sentences as shorter sentences are easier to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some small grammatical suggestions, but otherwise looks all good 👍
docs/source/data/data_catalog.md
Outdated
### Configuring dataset parameters in `catalog.yml` | ||
|
||
The dataset configuration in `catalog.yml` is defined as follows: | ||
1. The Top-level key is the dataset name used as a dataset identifier in the catalog - `shuttles`, `weather` in the example below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1. The Top-level key is the dataset name used as a dataset identifier in the catalog - `shuttles`, `weather` in the example below. | |
1. The top-level key is the dataset name used as a dataset identifier in the catalog - `shuttles`, `weather` in the example below. |
docs/source/data/data_catalog.md
Outdated
1. The Top-level key is the dataset name used as a dataset identifier in the catalog - `shuttles`, `weather` in the example below. | ||
2. The next level includes multiple keys. The first one is the mandatory key, `type,` which defines the type of dataset to use. | ||
The rest of the keys are dataset parameters and vary depending on the implementation. | ||
To get the extensive list of dataset parameters, refer to {py:mod}`The kedro-datasets package documentation <kedro-datasets:kedro_datasets>` and navigate to the `__init__` method of the target dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To get the extensive list of dataset parameters, refer to {py:mod}`The kedro-datasets package documentation <kedro-datasets:kedro_datasets>` and navigate to the `__init__` method of the target dataset. | |
To get the extensive list of dataset parameters, see {py:mod}`The kedro-datasets package documentation <kedro-datasets:kedro_datasets>` and navigate to the `__init__` method of the target dataset. |
docs/source/data/data_catalog.md
Outdated
To get the extensive list of dataset parameters, refer to {py:mod}`The kedro-datasets package documentation <kedro-datasets:kedro_datasets>` and navigate to the `__init__` method of the target dataset. | ||
3. Some dataset parameters can be further configured depending on the libraries underlying the dataset implementation. | ||
In the example below, the configuration of the `load_args` parameter is defined by the `pandas` option for loading CSV files, while the configuration of the `save_args` parameter is defined by the `snowpark` `saveAsTable` method. | ||
To get the extensive list of dataset parameters, refer to {py:mod}`The kedro-datasets package documentation <kedro-datasets:kedro_datasets>` and navigate to the target parameter in the `__init__` definition for the dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To get the extensive list of dataset parameters, refer to {py:mod}`The kedro-datasets package documentation <kedro-datasets:kedro_datasets>` and navigate to the target parameter in the `__init__` definition for the dataset. | |
To get the extensive list of dataset parameters, see {py:mod}`The kedro-datasets package documentation <kedro-datasets:kedro_datasets>` and navigate to the target parameter in the `__init__` definition for the dataset. |
docs/source/data/data_catalog.md
Outdated
3. Some dataset parameters can be further configured depending on the libraries underlying the dataset implementation. | ||
In the example below, the configuration of the `load_args` parameter is defined by the `pandas` option for loading CSV files, while the configuration of the `save_args` parameter is defined by the `snowpark` `saveAsTable` method. | ||
To get the extensive list of dataset parameters, refer to {py:mod}`The kedro-datasets package documentation <kedro-datasets:kedro_datasets>` and navigate to the target parameter in the `__init__` definition for the dataset. | ||
For those parameters we provide a reference to the underlying library configuration parameters. For example, under the `load_args` parameter section for [pandas.ExcelDataset](https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-3.0.1/api/kedro_datasets.pandas.ExcelDataset.html) you may find a reference to the [pandas.read_excel](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html) method defining the full set of the parameters accepted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For those parameters we provide a reference to the underlying library configuration parameters. For example, under the `load_args` parameter section for [pandas.ExcelDataset](https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-3.0.1/api/kedro_datasets.pandas.ExcelDataset.html) you may find a reference to the [pandas.read_excel](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html) method defining the full set of the parameters accepted. | |
For those parameters we provide a reference to the underlying library configuration parameters. For example, under the `load_args` parameter section for [pandas.ExcelDataset](https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-3.0.1/api/kedro_datasets.pandas.ExcelDataset.html) you can find a reference to the [pandas.read_excel](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html) method defining the full set of the parameters accepted. |
docs/source/data/data_catalog.md
Outdated
For those parameters we provide a reference to the underlying library configuration parameters. For example, under the `load_args` parameter section for [pandas.ExcelDataset](https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-3.0.1/api/kedro_datasets.pandas.ExcelDataset.html) you may find a reference to the [pandas.read_excel](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html) method defining the full set of the parameters accepted. | ||
|
||
```{note} | ||
Kedro datasets make every intention to not make any assumptions and delegate any of the `load_args` / `save_args` directly to the underlying implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kedro datasets make every intention to not make any assumptions and delegate any of the `load_args` / `save_args` directly to the underlying implementation. | |
Kedro datasets delegate any of the `load_args` / `save_args` directly to the underlying implementation. |
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 👍
Description
Solves #3919
Developer Certificate of Origin
We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a
Signed-off-by
line in the commit message. See our wiki for guidance.If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.
Checklist
RELEASE.md
file